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extractions in a text and then categorize the data according to the SA 
techniques. Keeping the focus on twitter data, the data is extracted in domain 
Keywords: specific manner. In data cleaning phase, noisy data, missing data, 
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tokenization is performed which is followed by stop word removal (SWR). 
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1. INTRODUCTION 

Sentiment analysis (SA) is a field of natural language processing (NLP) for analyzing the opinion, 
expression and attitude of the user towards an entity which can be an individual, a place, an event, product, 
and issue or a discussion. The markets, firms and organizations utilize sentiment analysis to attract and 
satisfy more customers. Political parties of different countries use the SA data to achieve the satisfaction of 
the citizens or to know the feedback of its people about its administration and policies. SA offers more 
challenging opportunities to develop new applications where sentiments of customers, viewers or users play a 
vital role. The high level of precision of any decision can bring an organization at the top position among its 
competitors and can make any organization fall from the roof [1] with the advanced technology and increasing 
the use of gadgets, sentiments are not only obtained from orthodox feedbacks only but may also be in the form of 
audios, videos, images, texts, micro-blogs, tweets, posts and comments on social media or emoticons. 

Analyzing such heterogeneous data in real- time with appreciable level of precision has always been 
a challenge. Most of the organizations use some third-party tool such as WorkForce, Glassdoor or sometimes 
make a separate cell to analyze this data which provide them genuine sentiment analysis followed by an 
accurate decision making. Inaccurate SA or poor level of accuracy in SA may prove to be disastrous for any 
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organization in present time of cut throat market competition. A lot of improvement in SA and its 
applications has been made in recent past. 


2. LITERATURE REVIEW 

More often the information available on social media sites is not well formed in structure like it has 
abbreviations, miss-spelling, emoticons, slang language, and html tag which turn into a difficult task for 
sentiment analysis [2], [3]. This type of data increases the dimensionality which may leads to bad 
performance of sentiment analysis (SA) techniques [4]. To reduce the dimensionality, before applying SA 
technique, the data has been passed through two phases i.e., data cleaning and pre-processing. The accuracy 
of SA techniques majorly depends on these phases. 

Many techniques have been used for data classification and data clustering. It is found that in most 
of the cases, hybrid methods are used for classification and clustering based on the problem at hand [5]. 
Classification techniques are the sub-category of supervised techniques where some kind of labelled data is 
used to train the dataset based on which outcome is provided. The degree of accuracy is highly dependent on 
the accuracy of training data [6]. A semi-supervised technique, regularized least squares (RLS), was 
introduced to represent unlabeled and labelled data on bipartite graph representation to analyze the sentiment 
of documents as well as of words [7]. The blogs considering enterprise software products, politics and movie 
reviews were considered and the technique produced 90% accuracy. Using supervised techniques such as support 
vector machine (SVM), multinomial Naïve Bayes (MNB) and maximum entropy (Max Ent), Erik and Francine 
[8] worked on multilingual unformatted dataset of English, Dutch and French languages and achieved 58% 
accuracy. In this technique, the implementation of automated analysis provided impending benefits such as 
word-of-mouth marketing, real-time response and neutral fetching of information. The smallness of dataset 
did not help much in managing noise. Term-document matrix was employed for tri-factorization by making 
some simple updates in rules where sentiment lexicons were used as the first set of constraints. Second set of 
constraints retained domain specific supervisions [9]. 

Blogs having four different dimensions of discussion i.e., software product, politics, amazon product 
reviews and movie reviews were worked upon. In a new approach rule base classification and machine 
learning approaches were coupled together and 50% accuracy was achieved [2]. A compact semi-supervised 
classifier was introduced in which classifiers were assigned according to the type of text. In this pipeline 
approach, 10-fold cross validation was performed which resulted in higher efficiency but consumed more time. 

Li et al. [10] had utilized SVM and NB techniques for reviews of Books, DVD, kitchen appliances 
and electronic items. The data was split into two categories, personal and impersonal text as co-training data. 
This, along with improving baseline accuracy [11], reduced classification noises and needed no proper 
syntactical rules. Lack of labelled data was treated well by dividing imbalance population into multiple sets 
of balanced population for sentiment classification and multiple iteration improved performance [12]. A new 
approach was provided for active-learning in multi-domain framework. The term frequency method was used 
to weigh the features along with LIBLINEAR SVM. This method was compared with spam mail filtering, 
newsgroup classification and sentiment classification where human efforts were reduced by 33.2%, 42.9% 
and 68.7% respectively. 

To classify sentiments of micro blogs, a method was proposed in which machine learning was 
combined with domain specific techniques and a system called opinion miner was introduced [13]. The 
precision of opinion miner stood at 96%. Lack of stop-criteria to control iteration created unnecessary data 
sampling. Numerical matrix representation was used for movie reviews and positive or negative reviews 
were obtained with 89.5% accuracy with NB. SVM enhanced accuracy to 94% [1]. Sentiment analysis was 
performed on reviews [14]. The performance of unigram with stop word settled at 82.9% and that of without 
stop words came 83% with positive class. The same was higher for negative class. 

Based on deep learning parameters, a model was proposed to address implicit and explicit sentiment 
factors on text data and used word embedded representation in Vietnamese and English language [15]. The 
proposed model proved to be better than traditional machine learning methods and provided results up to 
87% of sentiment analysis in all available corpora. 

Clustering techniques are used to cluster the data based on different parameters and to form groups 
as per the requirements for business analytics for better decision making [16]. A techniques using k-means 
clustering was proposed to cluster the document in combination with scoring technique [17]. The movie 
review dataset was examined with 77.17% accuracy. Spam filtering technique was developed which was 
based on the vector space model by using text clustering k-means and balanced iterative reducing and clustering 
using hierarchies (BIRCH) technique [18]. K-means clustering provided better results for smaller data. k nearest 
neighbours (k-NN) and BIRCH were shown to be good for larger datasets. 

Venkatasubramanian et al. [19] proposed that without employing syntactic processing, stop words 
could be used for classification. A semantic clustering algorithm, latent dirichlet allocation (LDA), was 
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applied and 66% for the polarity of reviews was observed. This method proved to be complementary for the 
syntactic approach for sentiment analysis. For product reviews, a semi-supervised technique was given where 
words and phrases belonging to similar domain were grouped under same feature set [20]. The expectation 
maximization (EM) algorithm based on naïve Bayes was applied on five datasets and results were found to 
be superior to the 13 baselines which presented current state of art solution. The authors have presented a 
contextual multimodal method to analyze visual, textual and audio cues of approx 800 utterances for Persian 
language and achieve a performance of around 91% [21]. 

An approach was suggested which was based on clustering with term frequency — inverse document 
frequency (TF-IDF) weighing technique, voting mechanism and important term score [22]. This approach 
was shown to be efficient, automated, accurate and faster than other supervised learning techniques. Further 
modifications were introduced in clustering technique which worked without prior knowledge of training 
dataset, human intervention and linguistic knowledge [23]. This automated method increased the accuracy of 
baseline up to 76%. 

Suresh and Raj [24] presented an aspect level method to find the sentiment of a particular brand 
using twitter feed with the help of novel fizzy clustering and obtained accuracy of 76.4% with faster 
execution. Combinational effects of clustering were shown along with sentiment analysis on review datasets 
[25]. It was found that the K-means clustering algorithm provided better results than the balanced review 
datasets. The newly designed weighing system was shown to be better than traditional ones. Sentiments of 
movie reviews were analyzed by applying Word2Vec algorithm and K-means++ algorithm [26]. It was 
argued that this approach could be used for sarcasm and question detection with additional modifications. 

A comparative study for sentiment analysis approach adopted by researchers has been shown in 
Table 1 by utilizing domain information and the languages for which the proposed work has been done is 
discussed. To extend the understanding, the stages of pre-processing has been taken into account followed by 
the machine learning techniques for analysis purpose and the accuracy that has been achieved so far. 


Table 1. Comparative analysis of approaches adopted by researchers 


Ref Domain Language Preprocessing Classification and Limitations Result 
clustering Algorithm 
[27] Labeled product English 1 Gram, 2 Classifier level fusion — The classifier level fusion 80% 
review of four Gram, 1+2 and feature approach faced unbalanced 
domains: books, Gram and 1 performance for multi- 
DVDs, kitchen Gram +2 Gram domain data. 
appliances, approach 
Electronics 
[8] Blogs, forum text, English, Unigram, Bi Cascade learner — Lack of training data, Approx 
review related to Dutch, Subjectivity, classifier — Conflicts sentiments, 70% 
products French Bigram — lack of pattern detections 
[28] Tweets which English Normalized Annotated ensembles — Word with different 47% F- 
contain some words spelling and score 
noise, associated to representation cannot be 
the mobile mapped 
operators 
[26] Movie review English Word2Vec K-means/K-means++ — Couldn’t enhance - 
accuracy of baseline 
[29] Tweets Spanish, Unigram, Cascade Classifiers — Normalization techniques 69% 
English Bigram and Slang dictionary can approx 
be include for Spanish 
[30] Tweets on Indian Hindi Hashtag, URL, Dictionary based, — Limited data size, 78% 
political parties Stopwords naive Bayes and SVM -— Emoticons are not include 
removal, algorithm in data set 
negation — Only text data included 
handling 
[31] Travel destination Hindi, POS SVM missing concepts for Marathi 72% 
reviews Marathi language was there, by 
considering these the 
accuracy can be enhanced 
[32] Social media text Hinglish Lowering Case, Convolution neural Bilingual model required 83% 


Lemmatization, network (CNN), 
Multiword long short term 
Grouping frequency (LSTM), 


convolution neural 


network- bidirectional 


LSTM (CNN- 
BiLSTM) 
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Researchers have worked on sentiment analysis in English as well as in native languages like 
Chinese, Spanish, Marathi, and Tamil. The work has been done in bilingual as well as multilingual like Hindi 
and Hindi-English combination, Tamilish, and English+French. There are many such combinations used in 
other pair of languages. However, combination of more than two languages has not been worked upon 
extensively. Due to the difficulties encountered for various reasons for example ambiguous words, 
inconsistent spelling, and part of speech and bag of words, sentiment analysis for the combination of Hindi- 
Hinglish-English languages have been missing from. 

The researcher has mainly extended their hands on social network data, blogs, product reviews, and 
political reviews and subjective tweets. After the data collected by user from multiple sources, the pre- 
processing of data is done with the help of commonly used pre-processing techniques. As far as pre- 
processing steps are concern the most commonly used techniques are stop-word removal, lemmatization, 
stemming, lowering cases, part of speech, and normalization, which has shown the great effect on the 
accuracy of machine learning models. 

The processed data is inputted to machine learning models for sentiment analysis purpose. For the 
classification and clustering methods, researchers have utilized from the simplest algorithms like naïve 
Bayes, SVM, K-means to the cascade classifiers and neural network techniques too. These techniques are 
mostly restricted themselves to the monolingual to bilingual concept, has taken into account the text data 
only and are restricted themselves with a small training dataset. Since the accuracy of these techniques is 
restricted to the range of 70-80%. So, the more chances are lies to enhance the accuracy level. Present paper 
is an attempt to highlight dataset creation from the tweets fetched from the twitter based on the hashtags. The 
pre-processing techniques utilized by researchers, the impact of these techniques and the comparative 
analysis of them are discussed in Table 1 (see in Appendix). 


3. METHOD 

To predict the human behaviour against an organization or entity or product, the sentiment analysis 
techniques play a vital role. In today’s world where word-of-mouth, customer feedback, reviews and 
opinions have become major issues, sentiment analysis (SA) and opinion mining are the two techniques 
being used invariably [33], [34]. Subjective extraction of opinion related to an entity falls under opinion 
mining whereas is SA complete text analysis is performed [35]. SA represents sentiment identification in a 
text then followed by its analysis. The accuracy of decision making lies in the accuracy of sentiment analysis. 
The complete procedure of SA of multiple events, parallel running sub-event on social media (SM) and their 
influence on behaviour, reaction and even on thoughts of people have been discussed in [36]. The generalized 
process of performing SA from social networking data is as follows: 


3.1. The opinion of users 

The first step of SA is to retrieve the information (opinionative words, and phrases) from the huge 
amount of data available and to store this information in the required format [37]. Keeping main concern on 
finding influence of events and sub-events, data has been picked from twitter using the hashtag information 
to find events, user mentions and retweets to find the sub-event count. The information in the similar way is 
extracted and represented [38]. Figure 1 representing the word cloud for the extracted hashtags. 


nan IndianBraves 


“Nan 


IndianBraves LestWeFor getIndia 


Pūlwama- nan 


Sestirorgetindig nan 


nan Pulwama 


Figure 1. Wordcloud for extracted 
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3.2. Data cleaning 

Data cleaning process is to make the text noise-free. The noise in the data is in the form of missing 
words or unwanted words. These missing or unwanted words are present in the disguise of some symbols 
which are not handled by the code. Data cleaning comprises information filtering, removal of stop- 
words/punctuation and tokenization process [36]. The Figure 2 given below provides the glimpse of the 
format of data, which is collected from twitter, is being cleaned after applying the clean text function. After 
the cleaning process data is converted into lowercase, tokenized and stopword have been removed and Figure 3 
represents the process. 


RT @Sakthivelavan5: THANK YOU FOR THANK YOU FOR SUBSCRIBE OOo 
SUBSCRIBE 
https://t.co/nBcYmVKmvE 


1,18E+18|Tue Oct 0]<a href="H 


@isro @PMOlIndia Hi Team, Hi Team Since last two days Have chance analyse 
Since last two days, Have a chance to moon surface earth It @@@ reflece@Oeo 
analyse the moon from surface of earth. 


tOO: reflecePOo 


https://t.co/JDGBOJBjjS 
1.18E+18|Tue Oct 0]<a href=" 


RT @SakthivelavanS: THANK YOU FOR THANK YOU FOR SUBSCRIBE $Q 
SUBSCRIBE 


1.18E+18|Tue Oct 0]<a href="}MtPs://t.co/nBcymVKmvE 


RT @kshama Lively: PM ko "Panauti" kehne |PM ko `“ Panauti " kehne aur mission ko fail kehne 
aur mission ko fail kehne wale sune: wale sune USA 12 Attempts Russia 7 Attempts 
China 3 Attempts To@@@ 

USA (12 Attempts) 
Russia (7 Attempts) 
China (3 Attempts) 


|_1.18E+18/Tue Oct 0}<a href="H 


Figure 2. Clean text hashtags 


3.3. Data pre-processing 

Data pre-processing consists of tokenization, part of speech, normalization, lemmatization and 
stemming of words where network among words is established [39]. Through this, all the relevant events and 
sub-events are mapped onto connecting words for example: -thanking you, thank you can be mapped onto 
thank. Here the stemming algorithm is applied on clean text. Figure 3 shows the stemmed dataset. The 
Algorithm 1 has been introduced to represent the overall procedure used for text processing. 


Figure 3. Tokenized and stemmed data 
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Algorithm 1. Text preprocessing algorithm 

for each tweet in Document do 

perform tokenization by splitting the text 
zonoring wY & wy 

end for 

for each token in Document do 
tweet.remove_ stopwords (english) 
tweet.remove punctuations 

tweet.remove_ colon-symbol 
tweet.replace_ non ASCII char with space 
tweet.remove emoticons. 

end for 

for each remaining word in dataset do 
perform stemming using stemmer and store in Vector (Word List) 
end for 


3.4. Feature extraction 

The feature extraction involves vectorization, bag of words, TF-IDF, N-Gram and word embedment 
techniques. The feature extraction maps data in vector space. In this phase, different hashtags which have 
similar meaning or which signify similar events are mapped together. For example, hashtags #chandrayan, 
#Chandrayanl, #ISRO, #IndiaFails signifies the same event ‘chandrayan’ and is clubbed together. Similarly, 
the tweets which have user-mentions similar to the events have also been clubbed together and considered as 
sub events. 


3.5. Choosing machine learning techniques 

The machine learning approaches based on problem definition are applied to find more accurate 
results. As per the discussion above, SVM and naive Bayes classification techniques are mainly used when 
users have some predefined rules which are needed to be followed. For clustering techniques, K-means and 
Fuzzy logic algorithms are mainly employed due to their simplicity and accuracy [40]. However, some 
people have also used semi-supervised and hybrid techniques as well [20], [26]. 


3.6. Output for predictive level of sentiment analysis 

Base on the predictive polarity levels, the polarity of the text is calculated [41]. Thereafter, on this 
calculated level of the polarity, sentiment is fixed as negative sentiment, positive sentiment or a neutral one. 
The process of performing sentiment analysis and then finding the polarity can be summarized as shown in 
Figure 4. 


Information 


Selection 


Processed 


Figure 4. Method for sentiment analysis 


4. RESULT 

The results after applying each of pre-processing techniques namely tokenization, stopword removal 
and stemming have been shown in the Table 2 and plotted against the dataset size in Figure 5. Different 
values of dataset have been taken for testing the behaviour based on pre-processing stages. It has been 


Text pre-processing of multilingual for sentiment analysis based on social network data (Neha Garg) 


782 o ISSN: 2088-8708 


demonstrated that stopword removal and stemming are the compulsory parts for pre-processing. It has also 
been shown that the stopword removal has reduced the dimensionality of the text handsomely. From the 
Figure 5, it has been concluded that the application of pre-processing techniques has a positive impact on the 
number of terms selected. The results represent that negligible difference is shown in terms of numbers 
selected by stemming. 


Table 2. Statistics for preprocessing 
Dataset (Tweets) Tokenization Stop-word removal Stemming 


2,000 12,000 10,000 9,000 
5,000 30,000 22,000 20,000 
9,000 53,000 45,000 42,000 
2,000 70,000 57,000 55,000 


75000 
70000 
65000 = Tokenization 
60000 e SWR 
55000 
50000 
45000 
40000 
35000 
30000 
25000 
20000 
15000 
10000 A 
5000 


4 Stemming 


T T T T T 


T T T T 
6000 8000 10000 12000 


Size of dataset 


T T 
2000 4000 


Figure 5. Effect of preprocessing 


5. CONCLUSION 

The present article discusses the supervised and unsupervised SA techniques. In this paper, the basic 
techniques of data extraction followed by the data cleaning and data pre-processing techniques have been 
presented. Three basic techniques for pre-processing i.e. tokenization, stopword removal and stemming have 
been introduced on twitter dataset. From the results, it can be concluded that pre-processing bears a huge 
impact to reduce the dimensionality of data which in-turn results in a high performing and more accurate SA 
techniques. The results prove that the stopword removal technique removes unnecessary words from the 
dataset and thereby improving accuracy. The same technique may be applied to the different dataset 
belonging to different domain. One can improve upon the list of stopword as per the domain and achieve 
better accuracy. 
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